CPSC 545/445 (Autumn 2003) - Class 11: Gene Finding (1) Module 4, Part 1 --- 4.1 Gene finding - background / motivation increasing amount of genomic sequence data -> interpretation of this data lagging behind. for genomic dna sequence data from higher eukaryotes: computational gene finding, i.e., identifying intron/exon structures, is one of the main problems of bioinformatics. the problem is closely related to to fundamental biochemical issues of specifying the precise determinants of: - transcription - translation - RNA splicing the problem is also of significant practical importantce, as computer software for gene finding is routinely used by genome sequencing laboratories to help identify genes in newly sequenced regions. -- recap: gene structure - genes of most eukaryotic are neither continuous nor contiguous: - they are seperated by long stretches of intergenic DNA - their coding sequences are interrupted by non-coding introns - only a small part of the genome is conding sequence (human: ca. 3%) - alternative splicing (at least 35% of human genes - Mironov et al., 1999) - nested genes (Dunham et al., 1999) - overlapping genes on the same or opposite strands (Schulz and Butler, 1989; Ashburner et al., 1999; Cooper et al., 1998) - pseudogenes = non-functional sequences resembling real genes occur in numerous copies throughout the genome regulatory regions are crucial for gene expression their location relative to target gene is not uniquely determined: - basic regulatory elements (e.g., TATA / CAT boxes) usually upstream of transcription start site - enhancers and silencers can be distant upstream, downstream, even within introns --- 4.2 Approaches to computational gene finding the computational gene finding problem given raw sequence data, predict: - coding and non-coding regions - exons/introns - splicing patterns - transcription factors - ... -> genomic sequence annotation [slide: Figure 1] -- naive approach: - search for characteristic subsequences (pattern matching) (e.g., TATA, CCAAT, GC boxes, etc.) example: intron identification (GT-AG rule) [slide: Figure 31.2] problems: - only covers most conserved signals (subsequences) - these are not sufficient for characterising genes (exons) -- ideal approach: - complete simulation of DNA transcription, RNA splicing/processing, RNA translation - base gene prediction on results of this simulation problem: - assumes solution of most important, open biochemical problems - computational complexity might be very high -- finding open reading frames (ORFs): - mark all stop codons in all reading frames - long stretches of uninterrupted sequence between stop codons give candidates for genes (exons) - for prokaryotes, also use initiation codon ATG [slide: Figure 7-15] problems: - distribution of ORF length almost identical for random sequence data and real genomic sequence (claverie et al., 1996) -> ORF length alone gives almost no information on protein coding regions -- how is it really done? - signal detection (splice sites, promoter regions, etc.) note: signals are often not 100% conserved example: intron 5' splice site -> probabilistic models (as in phylogeny!) [slide: Figure 7-15] [slide: Figure 1.8 - intron 5' splice site] - compositional properties of coding vs. non-coding DNA (GC content, hexamer frequencies) - integration of the above - integration with homology searching -> modern gene finders predict individual functional elements and entire gene structure (sets of spliceable exons) -- signal detection: - simple motif search: typically not good enough - Weight Matrix Method (WMM) - Weight Array Method (WAM) - Hidden Markov Models (HMMs) -- fundamental method: - build generative, probabilistic models of signals - use these to compute probability for given sequence - high probability of generation -> predict signal --- 4.3 Weight Matrix and Weight Array Methods WMM method: given: - frequency p^i_j of nucleotide j at position i - sequence X=x1 x2 x3 ... xn probability P{X} of generating X = p^1_x1 * p^2_x2 * ... * p^n_xn -- generalisation: WAM method, models dependencies between positions (probability of seeing x2 at pos 2 depends on what was seen at pos 1) -- where do the model parameters (p^i_j) come from? - manual determination (human experience) - training: learned from aligned sequence data of known signals (problem: bias towards known data, poor performance for unknown sequence data) -- note: wmm/wam can still be too weak to reliably detect signals such as intron/exon boundaries [slide: Fig. 1] --- Resources: D. Haussler. Computational Genefinding. [online, from his webpage: www.cse.ucsc.edu/~haussler] J.-M. Claverie, O. Poirot, F. Lopez. The difficulty of identifying genes in anonymous vertebrate sequences. Computers Chem. 21(4): 203-214, 1997. J.W. Fickett. The gene identification problem: an overview for developers. Computers Chem. 20(1): 103-118, 1996. R. Guigo. Computational gene identification: an open problem. Computers Chem. 21(4): 215-222, 1997. ---